Preparations

if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
  BiocManager::install("ComplexHeatmap")
if (!requireNamespace("circlize", quietly = TRUE))
  install.packages("circlize")
if (!requireNamespace("gprofiler2", quietly = TRUE))
  install.packages("gprofiler2")
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
if (!requireNamespace("tibble", quietly = TRUE)) {
  install.packages("tibble")
}
if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
} 
if (!requireNamespace("edgeR", quietly = TRUE)) {
  install.packages("edgeR")
}
if (!requireNamespace("limma", quietly = TRUE)) {
  install.packages("limma")
}
if(!requireNamespace("tidyr", quietly = TRUE)) {
  install.packages("tidyr")
}
if(!requireNamespace("stringr", quietly = TRUE)) {
  install.packages("stringr")
}
if(!requireNamespace("knitr", quietly = TRUE)) {
  install.packages("knitr")
}
suppressPackageStartupMessages({
library("edgeR")
library("circlize")
library("ComplexHeatmap")
library("gprofiler2")
library("ggplot2")
library("tibble")
library("dplyr")
library("stringr")
library("knitr")
})

Introduction: Recap from assignment 1 and assignment 2

In assignment 1, GEO dataset GSE150621 is chosen and it studies Gestational diabetes (GDM)Exposure to gestational diabetes enriches immune-related pathways in the transcriptome and methylome of human amniocytes(Pinney et al. 2020), which has profound effects on the intrauterine metabolic milieu and is linked to obesity and diabetes in offspring. A nested case-control study was performed in second trimeseter amniocytes matched for offspring sex, maternal race/ethnicity, maternal age, gestational age at amniocentesis, gestational age at birth and gestational diabetes status. Sex-specific RNA-sequencing was completed and gene expression changes were identified.The data was cleaned, mapped, and normalized using a standard pipeline. In assignment 2, the normalized data from a1 was conducted with differential gene expression analysis and threshold over-representation analysis using G profiler(Kolberg et al. 2020). It was found that the top-enriched genesets are involved in the regulation of viral process, Interferon alpha/beta signaling, and Type I interferon induction and signaling during SARS-CoV-2 infection.

Load the data

# Read the normalized counts of GSE150621 into a table
data<-read.table("./GSE150621_finalized_normalized_counts.txt") 
head(data)

Create the samples’ groups

# This block of code reads in a tab-separated text file containing normalized gene expression data from a GEO dataset with accession number GSE150621. The data is stored in a table called "data", and the first few rows are displayed using the "head" function.

samples <- data.frame(
  lapply(colnames(data)[3:16],
         FUN=function(x){
           unlist(strsplit(x, split = "_"))[3]})) 

samples=as.data.frame(t(samples))

samples[which(samples[,1]=="CF"|samples[,1]=="CM"),2]="Control" 
samples[which(samples[,1]=="DF"|samples[,1]=="DM"),2]="Case" 
samples[which(samples[,1]=="CF"|samples[,1]=="DF"),3]="Female"
samples[which(samples[,1]=="CM"|samples[,1]=="DM"),3]="Male" 
colnames(samples)=c("label","Group","Gender")
row.names(samples)=colnames(data)[3:16]

model_design <- model.matrix(~ samples$Group )

expressionMatrix <- as.matrix(data[,3:16])
rownames(expressionMatrix) <-data$refGene
colnames(expressionMatrix) <-colnames(data)[3:16]

Use the EdgeR method to demonstrate the top-hits.

# This block of code creates a model matrix for the differential gene expression analysis using the "Group" column of the "samples" data frame. It also converts the "data" table into a matrix called "expressionMatrix", with rows corresponding to genes and columns corresponding to samples. The row names of "expressionMatrix" are set to the "refGene" column of the "data" table, and the column names are set to the sample labels.
d = DGEList(counts=expressionMatrix, group=samples$Group)

model_design_gender <- model.matrix(~samples$Gender+samples$Group)

d <- estimateDisp(d, model_design_gender)

fit <- glmQLFit(d, model_design_gender)

qlf.pos_vs_neg <- glmQLFTest(fit,coef = "samples$GroupControl")
kable(topTags(qlf.pos_vs_neg), type="html",row.names = FALSE)
logFC logCPM F PValue FDR
-2.8334044 4.854485 43.99107 0.0000071 0.0853025
-2.2059407 4.103374 40.18761 0.0000119 0.0853025
-1.2407730 5.161937 32.67551 0.0000373 0.1649817
-1.5724349 4.958207 30.19268 0.0000567 0.1649817
-2.4987900 4.493680 30.10356 0.0000576 0.1649817
-3.1050278 5.564898 28.96479 0.0000703 0.1679848
-3.0608567 2.455475 25.24204 0.0001409 0.2663249
-1.3097728 3.279000 24.35787 0.0001678 0.2663249
-0.8614091 6.710179 24.25471 0.0001713 0.2663249
-1.1673469 5.915746 23.84979 0.0001859 0.2663249
x
BH
x
samples$GroupControl
x
glm
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
                           adjust.method = "BH",
                           n = nrow(data))

Export the ranked top hits.

# Exports the ranked top hits identified using the EdgeR method into a tab-separated text file called "ranked.rnk".
qlf_output_hits_withgn <- merge(data[,1:2],qlf_output_hits, by.x=1, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
qlf_output_hits_withgn <- qlf_output_hits_withgn[, c("HGNC_genes", "rank")]
write.table(x=data.frame(genename= qlf_output_hits_withgn$HGNC_genes,F_stat= qlf_output_hits_withgn$rank),
            file="ranked.rnk",sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

Non-thresholded Gene set Enrichment Analysis


Question 1: What method did you use? What genesets did you use? Make sure to specify versions and cite your methods.
I used GSEA (version 4.3.2)(Subramanian et al. 2007) for my non-thresholded gene set enrichment analysis. I used the genesets from Bader’s lab (https://download.baderlab.org/EM_Genesets/current_release/Human/symbol/), and I selected the latest file with GOBP and all pathways “Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt”. In GSEA, I firstly loaded two files: the geneset database from Bader’s lab and the ranks.rnk file, which was generated in the aboved step. Then I ran GSEA using the parameters of * Number of permutation: 1000 * Max size: 500 * Min size: 15

Question 2: Summarize your enrichment results.
For upregulated genes, there are: * 1135 / 5624 gene sets. * 366 gene sets are significantly enriched at nominal pvalue < 5%

For downregulated genes, there are: * 4489 / 5624 gene sets. * 1035 gene sets are significantly enriched at nominal pvalue < 5%

Question 3: How do these results compare to the results from the thresholded analysis in Assignment #2. Compare qualitatively. Is this a straight forward comparison? Why or why not?
The down-regulated genesets are similar to assignment 2’s result, focusing on the immune response. However, there are many up-regulated protein synthesis which did not occur in the results of A2. This is a straight forward comparision because the top 20 up-regulated pathways are all made up with different kinds of protein, while the ones in A2 focus on other different ones.

Visualize Gene set Enrichment Analysis in Cytoscape


Question 1: Create an enrichment map - how many nodes and how many edges in the resulting map? What thresholds were used to create this map? Make sure to record all thresholds. Include a screenshot of your network prior to manual layout.

In Cytoscape(Shannon et al. 2003), I used the enrichmentMap app(Isserlin et al. 2014) 147 nodes and 721 edges were generated in the resulting map. The default threshold for node cutoff of Q-value of 0.01, and edge cutoff of 0.375. The blue nodes are the downreguated genesets and the red nodes are the upregulated genesets.

The below is an overviewed of the whole network.

origin_overview
Figure 1: The basic overview of the geneset pathways generating by the default threshold.

origin
Figure 2: The most interconnected focus of the geneset pathways generating by the default threshold.


Question 2: Annotate your network - what parameters did you use to annotate the network. If you are using the default parameters make sure to list them as well.
I annotated useing the AutoAnnotate app(Kucera et al. 2016), and the default parameters with the size of the node corresponding to the size of the geneset, colour of node corresponds to the phenotype and the p-value, label of the node corresponds to the geneset description, and thickness of the edge corresponds to overlap statistic. The more genes two nodes have in common the thicker the edge.

The following figure is the output generated: annotated
Figure 3: The annotated gene pathways generated using the AutoAnnotate app.


Question 3: Make a publication ready figure - include this figure with proper legends in your notebook.

The following figure is the publication-ready figure: publication_ready
Figure 4: The publication ready figure of the genesets.

legends


Figure 5: The legends for figure 4.


Question 4: Collapse your network to a theme network. What are the major themes present in this analysis? Do they fit with the model? Are there any novel pathways or themes?
The major theme of the pathway is focued on the immune system defense, which is shown as below. The results fit with the model, and the theme is similar to the original paper’s conclusion.

legends


Figure 6: The collapsed and themed network for my GSEA geneset.

Intepretations

Question 1: Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment #2 thresholded methods?
Yes, in the original paper, it shows that the strong enrichment of inflammation and immune response pathways was present in all 3 analyses of GDM-exposed amniocytes, and the enrichment results support this. In assignment 2, the most enriched pathways in the threshold analysis are:
GO:BP: 3721, with negative regulation of viral process (16) reactome: 623, with Interferon alpha/beta signaling(17)
WikiPathways: 359, with Type I interferon induction and signaling during SARS-CoV-2 infection(10). All these could be found in the GSEA enrichment results. However, there are some up-regulated pathways the GSEA results that did not occur in the A2 thresholded method. These are the pathways of 19 protein synthesis placed in the top-enriched rank of the up-regulated pathways from the GSEA results.

Question 2: Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result?
I did not find any external publications which could explain why there are so many up-regulated protein synthesis. But from the original paper discovered that many top expressed genes are interferon proteins, and it might indirectly lead to many different protein synthesis.

Question 3: Choose a specific pathway or theme to investigate in more detail. Why did you choose this pathway or theme? Show the pathway or theme as a gene network or as a pathway diagram. Annotate the network or pathway with your original log fold expression values and p-values to show how it is effected in your model.

I chosed the pathway of interferon signalling because it is important to the paper as it found expression of interferon-stimulated genes was increased in GDM amniocytes. I generated it as a gene network using STRING(Doncheva et al. 2018), as shown below:

legends
Figure 7: The network of interferon signaling.

References

Doncheva, Nadezhda T, John H Morris, Jan Gorodkin, and Lars J Jensen. 2018. “Cytoscape StringApp: Network Analysis and Visualization of Proteomics Data.” Journal of Proteome Research 18 (2): 623–32.
Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics 32 (18): 2847–49.
Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in r.” Bioinformatics 30 (19): 2811–12.
Isserlin, Ruth, Daniele Merico, Veronique Voisin, and Gary D Bader. 2014. “Enrichment Map–a Cytoscape App to Visualize and Explore OMICs Pathway Enrichment Results.” F1000Research 3.
Kolberg, Liis, Uku Raudvere, Ivan Kuzmin, Jaak Vilo, and Hedi Peterson. 2020. “Gprofiler2–an r Package for Gene List Functional Enrichment Analysis and Namespace Conversion Toolset g: Profiler.” F1000Research 9.
Kucera, Mike, Ruth Isserlin, Arkady Arkhangorodsky, and Gary D Bader. 2016. “AutoAnnotate: A Cytoscape App for Summarizing Networks with Semantic Annotations.” F1000Research 5 (1717): 1717.
Morgan, Martin. 2022. BiocManager: Access the Bioconductor Project Package Repository. https://CRAN.R-project.org/package=BiocManager.
Müller, Kirill, and Hadley Wickham. 2022. Tibble: Simple Data Frames.
Pinney, Sara E, Apoorva Joshi, Victoria Yin, So Won Min, Cetewayo Rashid, David E Condon, and Paul Zhipang Wang. 2020. “Exposure to Gestational Diabetes Enriches Immune-Related Pathways in the Transcriptome and Methylome of Human Amniocytes.” The Journal of Clinical Endocrinology & Metabolism 105 (10): 3250–64.
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.
Shannon, Paul, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13 (11): 2498–2504.
Subramanian, Aravind, Heidi Kuehn, Joshua Gould, Pablo Tamayo, and Jill P Mesirov. 2007. “GSEA-p: A Desktop Application for Gene Set Enrichment Analysis.” Bioinformatics 23 (23): 3251–53.
Wickham, Hadley. 2010. “Stringr: Modern, Consistent String Processing.” R J. 2 (2): 38.
———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Maintainer Hadley Wickham. 2017. “Package ‘Tidyr’.” Easily Tidy Data with’spread’and’gather ()’Functions.
Xie, Y. 2016. “Knitr: A General-Purpose Package for Dynamic Report Generation in r (r Package Version 1.15. 1)[computer Software].” Zugriff Am 7 (10): 2016.